Search CORE

1,993 research outputs found

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

State-of-the-art of related technologies to Alfanet

Author: Arana Cristina
Ayala Antonio
Barrera Carmen
Boticario Jesús
Brouns Francis
De Croock Marcel
Gaudioso Elena
Hernández Félix
Mofers Frans
Santos Olga
Trueba Irma
Van Rosmalen Peter
Van Veen Maarten
Publication venue
Publication date: 25/10/2002
Field of study

Open University of the Netherlands Research Portal

State-of-the-art of related technologies to Alfanet

Author: Arana Cristina
Ayala Antonio
Barrera Carmen
Boticario Jesús
Brouns Francis
De Croock Marcel
Gaudioso Elena
Hernández Félix
Mofers Frans
Santos Olga
Trueba Irma
Van Rosmalen Peter
Van Veen Maarten
Publication venue
Publication date: 25/10/2002
Field of study

Open University of the Netherlands Research Portal

Extracting Interests of Users from Web Log Data Log

Author: B Swetha, K Srinivasa Rao
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/09/2014
Field of study

The knowledge on the cobweb is growing expressively. Without a recommendation theory, the clients may come through lots of instance on the network in finding the knowledge they are stimulated in. Today, many web recommendation theories cannot give clients adequate symbolized help but provide the client with lots of immaterial knowledge. One of the main reasons is that it can't accurately extract users interests. Therefore, analyzing users' Web Log Data and extracting users' potential interested domains become very important and challenging research topics of web usage mining. If users' interests can be automatically detected from users' Web Log Data, they can be used for information recommendation and marketing which are useful for both users and Web site developers. In this paper, some novel algorithms are proposed to mine users' interests. The algorithms are based on visit time and visit density which can be obtained from an analysis of web users' Web Log Data. The experimental results of the proposed methods succeed in finding users interested domains

International Journal on Recent and Innovation Trends in Computing and Communication

Advanced Knowledge Technologies at the Midterm: Tools and Methods for the Semantic Web

Author: Ciravegna Fabio
Domingue John
Hall Wendy
Motta Enrico
O'Hara Kieron
Robertson David
Shadbolt Nigel
Sleeman Derek
Tate Austin
Wilks Yorick
Publication venue: School of Electronics and Computer Science, University of Southampton
Publication date: 01/01/2004
Field of study

The University of Edinburgh and research sponsors are authorised to reproduce and distribute reprints and on-line copies for their purposes notwithstanding any copyright annotation hereon. The views and conclusions contained herein are the author’s and shouldn’t be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of other parties.In a celebrated essay on the new electronic media, Marshall McLuhan wrote in 1962:Our private senses are not closed systems but are endlessly translated into each other in that experience which we call consciousness. Our extended senses, tools, technologies, through the ages, have been closed systems incapable of interplay or collective awareness. Now, in the electric age, the very instantaneous nature of co-existence among our technological instruments has created a crisis quite new in human history. Our extended faculties and senses now constitute a single field of experience which demands that they become collectively conscious. Our technologies, like our private senses, now demand an interplay and ratio that makes rational co-existence possible. As long as our technologies were as slow as the wheel or the alphabet or money, the fact that they were separate, closed systems was socially and psychically supportable. This is not true now when sight and sound and movement are simultaneous and global in extent. (McLuhan 1962, p.5, emphasis in original)Over forty years later, the seamless interplay that McLuhan demanded between our technologies is still barely visible. McLuhan’s predictions of the spread, and increased importance, of electronic media have of course been borne out, and the worlds of business, science and knowledge storage and transfer have been revolutionised. Yet the integration of electronic systems as open systems remains in its infancy.Advanced Knowledge Technologies (AKT) aims to address this problem, to create a view of knowledge and its management across its lifecycle, to research and create the services and technologies that such unification will require. Half way through its sixyear span, the results are beginning to come through, and this paper will explore some of the services, technologies and methodologies that have been developed. We hope to give a sense in this paper of the potential for the next three years, to discuss the insights and lessons learnt in the first phase of the project, to articulate the challenges and issues that remain.The WWW provided the original context that made the AKT approach to knowledge management (KM) possible. AKT was initially proposed in 1999, it brought together an interdisciplinary consortium with the technological breadth and complementarity to create the conditions for a unified approach to knowledge across its lifecycle. The combination of this expertise, and the time and space afforded the consortium by the IRC structure, suggested the opportunity for a concerted effort to develop an approach to advanced knowledge technologies, based on the WWW as a basic infrastructure.The technological context of AKT altered for the better in the short period between the development of the proposal and the beginning of the project itself with the development of the semantic web (SW), which foresaw much more intelligent manipulation and querying of knowledge. The opportunities that the SW provided for e.g., more intelligent retrieval, put AKT in the centre of information technology innovation and knowledge management services; the AKT skill set would clearly be central for the exploitation of those opportunities.The SW, as an extension of the WWW, provides an interesting set of constraints to the knowledge management services AKT tries to provide. As a medium for the semantically-informed coordination of information, it has suggested a number of ways in which the objectives of AKT can be achieved, most obviously through the provision of knowledge management services delivered over the web as opposed to the creation and provision of technologies to manage knowledge.AKT is working on the assumption that many web services will be developed and provided for users. The KM problem in the near future will be one of deciding which services are needed and of coordinating them. Many of these services will be largely or entirely legacies of the WWW, and so the capabilities of the services will vary. As well as providing useful KM services in their own right, AKT will be aiming to exploit this opportunity, by reasoning over services, brokering between them, and providing essential meta-services for SW knowledge service management.Ontologies will be a crucial tool for the SW. The AKT consortium brings a lot of expertise on ontologies together, and ontologies were always going to be a key part of the strategy. All kinds of knowledge sharing and transfer activities will be mediated by ontologies, and ontology management will be an important enabling task. Different applications will need to cope with inconsistent ontologies, or with the problems that will follow the automatic creation of ontologies (e.g. merging of pre-existing ontologies to create a third). Ontology mapping, and the elimination of conflicts of reference, will be important tasks. All of these issues are discussed along with our proposed technologies.Similarly, specifications of tasks will be used for the deployment of knowledge services over the SW, but in general it cannot be expected that in the medium term there will be standards for task (or service) specifications. The brokering metaservices that are envisaged will have to deal with this heterogeneity.The emerging picture of the SW is one of great opportunity but it will not be a wellordered, certain or consistent environment. It will comprise many repositories of legacy data, outdated and inconsistent stores, and requirements for common understandings across divergent formalisms. There is clearly a role for standards to play to bring much of this context together; AKT is playing a significant role in these efforts. But standards take time to emerge, they take political power to enforce, and they have been known to stifle innovation (in the short term). AKT is keen to understand the balance between principled inference and statistical processing of web content. Logical inference on the Web is tough. Complex queries using traditional AI inference methods bring most distributed computer systems to their knees. Do we set up semantically well-behaved areas of the Web? Is any part of the Web in which semantic hygiene prevails interesting enough to reason in? These and many other questions need to be addressed if we are to provide effective knowledge technologies for our content on the web

Southampton (e-Prints Soton)

Edinburgh Research Archive

Personalizing Access to Learning Networks

Author: Dolog Peter
Klobucar Tomaz
Nejdl Wolfgang
Simon Bernd
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2008
Field of study

VBN

Web structure mining of dynamic pages

Author: Naeem MA
Publication venue: Faculty of Computer & Emerging Sciences, Balochistan University of Information Technology and Management Sciences, Quetta
Publication date: 01/01/2006
Field of study

Web structure mining in static web contents decreases the accuracy of mined outcomes and affects the quality of decision making activity. By structure mining in web hidden data, the accuracy ratio of mined outcomes can be improved, thus enhancing the reliability and quality of decision making activity. Data Mining is an automated or semi automated exploration and analysis of large volume of data in order to reveal meaningful patterns. The term web mining is the discovery and analysis of useful information from World Wide Web that helps web search engines to find high quality web pages and enhances web click stream analysis. One branch of web mining is web structure mining. The goal of which is to generate structural summary about the Web site and Web pages. Web structure mining tries to discover the link structure of the hyperlinks at the inter-document level. In recent years, Web link structure mining has been widely used to infer important information about Web pages. But a major part of the web is in hidden form, also called Deep Web or Hidden Web that refers to documents on the Web that are dynamic and not accessible by general search engines; most search engine spiders can access only publicly index able Web (or the visible Web). Most documents in the hidden Web, including pages hidden behind search forms, specialized databases, and dynamically generated Web pages, are not accessible by general Web mining applications. Dynamic content generation is used in modern web pages and user forms are used to get information from a particular user and stored in a database. The link structure lying in these forms can not be accessed during conventional mining procedures. To access these links, user forms are filled automatically by using a rule based framework which has robust ability to read a web page containing dynamic contents as activeX controls like input boxes, command buttons, combo boxes, etc. After reading these controls dummy values are filled in the available fields and the doGet or doPost methods are automatically executed to acquire the link of next subsequent web page. The accuracy ratio of web page hierarchical structures can phenomenally be improved by including these hidden web pages in the process of Web structure mining. The designed system framework is adequately strong to process the dynamic Web pages along with static ones

AUT Scholarly Commons

Web Usage Mining: Algorithms and results

Author: LIM Ee Peng
NG Wee-Keong
Woon Yew-Kwong
Publication venue: Idea Group
Publication date: 01/01/2004
Field of study

Institutional Knowledge at Singapore Management University

An integrated mobile content recommendation system

Author: Paireekreng Worapat
Publication venue
Publication date: 01/01/2012
Field of study

Many features have been added to mobile devices to assist the user's information consumption. However, there are limitations due to information overload on the devices, hardware usability and capacity. As a result, content filtering in a mobile recommendation system plays a vital role in the solution to this problem. A system that utilises content filtering can recommend content which matches a user's needs based on user preferences with a higher accuracy rate. However, mobile content recommendation systems have problems and limitations related to cold start and sparsity. The problems can be viewed as first time connection and first content rating for non-interactive recommendation systems where information is insufficient to predict mobile content which will match with a user's needs. In addition, how to find relevant items for the content recommendation system which are related to a user's profile is also a concern. An integrated model that combines the user group identification and mobile content filtering for mobile content recommendation was proposed in this study in order to address the current limitations of the mobile content recommendation system. The model enhances the system by finding the relevant content items that match with a user's needs based on the user's profile. A prototype of the client-side user profile modelling is also developed to demonstrate the concept. The integrated model applies clustering techniques to determine groups of users. The content filtering implemented classification techniques to predict the top content items. After that, an adaptive association rules technique was performed to find relevant content items. These approaches can help to build the integrated model. Experimental results have demonstrated that the proposed integrated model performs better than the comparable techniques such as association rules and collaborative filtering. These techniques have been used in several recommendation systems. The integrated model performed better in terms of finding relevant content items which obtained higher accuracy rate of content prediction and predicted successful recommended relevant content measured by recommendation metrics. The model also performed better in terms of rules generation and content recommendation generation. Verification of the proposed model was based on real world practical data. A prototype mobile content recommendation system with client-side user profile has been developed to handle the revisiting user issue. In addition, context information, such as time-of-day and time-of-week, could also be used to enhance the system by recommending the related content to users during different time periods. Finally, it was shown that the proposed method implemented fewer rules to generate recommendation for mobile content users and it took less processing time. This seems to overcome the problems of first time connection and first content rating for non-interactive recommendation systems

Research Repository