Search CORE

733 research outputs found

Automated retrieval and extraction of training course information from unstructured web pages

Author: Daniela Xhemali (7168856)
Publication venue
Publication date: 01/01/2010
Field of study

Web Information Extraction (WIE) is the discipline dealing with the discovery, processing and extraction of specific pieces of information from semi-structured or unstructured web pages. The World Wide Web comprises billions of web pages and there is much need for systems that will locate, extract and integrate the acquired knowledge into organisations practices. There are some commercial, automated web extraction software packages, however their success comes from heavily involving their users in the process of finding the relevant web pages, preparing the system to recognise items of interest on these pages and manually dealing with the evaluation and storage of the extracted results. This research has explored WIE, specifically with regard to the automation of the extraction and validation of online training information. The work also includes research and development in the area of automated Web Information Retrieval (WIR), more specifically in Web Searching (or Crawling) and Web Classification. Different technologies were considered, however after much consideration, Naïve Bayes Networks were chosen as the most suitable for the development of the classification system. The extraction part of the system used Genetic Programming (GP) for the generation of web extraction solutions. Specifically, GP was used to evolve Regular Expressions, which were then used to extract specific training course information from the web such as: course names, prices, dates and locations. The experimental results indicate that all three aspects of this research perform very well, with the Web Crawler outperforming existing crawling systems, the Web Classifier performing with an accuracy of over 95% and a precision of over 98%, and the Web Extractor achieving an accuracy of over 94% for the extraction of course titles and an accuracy of just under 67% for the extraction of other course attributes such as dates, prices and locations. Furthermore, the overall work is of great significance to the sponsoring company, as it simplifies and improves the existing time-consuming, labour-intensive and error-prone manual techniques, as will be discussed in this thesis. The prototype developed in this research works in the background and requires very little, often no, human assistance

Loughborough University Institutional Repository

Application of evolutionary computation techniques for the identification of innovators in open innovation communities

Author: Martínez Torres María del Rocío
Publication venue
Publication date: 01/01/2013
Field of study

Open innovation represents an emergent paradigm by which organizations make use of internal and external resources to drive their innovation processes. The growth of information and communication technologies has facilitated a direct contact with customers and users, which can be organized as open innovation communities through Internet. The main drawback of this scheme is the huge amount of information generated by users, which can negatively affect the correct identification of potentially applicable ideas. This paper proposes the use of evolutionary computation techniques for the identification of innovators, that is, those users with the ability of generating attractive and applicable ideas for the organization. For this purpose, several characteristics related to the participation activity of users though open innovation communities have been collected and combined in the form of discriminant functions to maximize their correct classification. The right classification of innovators can be used to improve the ideas evaluation process carried out by the organization innovation team. Besides, obtained results can also be used to test lead user theory and to measure to what extent lead users are aligned with the organization strategic innovation policie

idUS. Depósito de Investigación Universidad de Sevilla

Look back, look around:A systematic analysis of effective predictors for new outlinks in focused Web crawling

Author: Atil Berk
Bucur Doina
Dang Thi Kim Nhung
Kadkhodaei Hamidreza
Litvak Nelly
Pitel Guillaume
Ruis Frank
Publication venue
Publication date: 15/11/2022
Field of study

Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and of its content-related pages. Hence, we propose a new 'look back, look around' (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. The LBLA approach proved extremely effective, outperforming other models including those that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modelling. All experiments were carried out on an original dataset, made available by a commercial focused crawler.Comment: 23 pages, 15 figures, 4 tables, uses arxiv.sty, added new title, heuristic features and their results added, figures 7, 14, and 15 updated, accepted versio

arXiv.org e-Print Archive

University of Twente Research Information

Layout Optimization of Microsatellite Components Using Genetic Algorithm

Author: Hardhienata Soewarto
Hermadi Irman
Mukhayadi Mohammad
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/03/2017
Field of study

The placement of satellite components usually belongs to non-deterministic polynomial-time hard (NP-hard) problems that in terms of computational complexity is very difficult to solve. This problem is normally known as layout optimization problem (LOP). In this study the layout of microsatellite components has to meet the requirements set by mission payloads, launcher and spacecraft attitude control. The novel scheme is to find the various possibilities of optimal layout using genetic algorithms combined with order-based positioning technique (OPT). Each component has a given index and then placed in a container based on specific order of placements in accordance with a bottom-left (BL) algorithm that is already established. The placement order is generated by the genetic algorithm which explore various possibilities to obtain a sequence that brings the best solution

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System