Search CORE

7 research outputs found

Searching dynamic Web pages with semi-structured contents

Author: Armando Oliveira
Filipe Silva
Gabriel David
Lígia M. Ribeiro
Publication venue
Publication date: 01/01/2003
Field of study

At present, information systems (IS) in higher education are usually supported by databases (DB) and accessed through a Web interface. So happens with SiFEUP, the IS of the Engineering Faculty of the University of Porto (FEUP). The typical SiFEUP user sees the system as a collection of Web pages and is not aware of the fact that most of them do not exist in the sense of being an actual HTML file stored in a server but corresponds to HTML code generated on the fly by a designated program that accesses the DB and brings the most up-to-date information to the user desktop. Typical search engines do not index dynamically generated Web pages or just do that for those that are specifically mentioned in a static page and do not follow on the links the dynamic page may contain. In this paper we describe the development of a search facility for SiFEUP, how the limitations put to indexing dynamic Web pages were circumvented, and an evaluation of the results obtained. The solution involves using a locally developed crawler, the Oracle Text full text indexer, plus meta-information automatically drawn from the DB or manually added to improve the relevance factor calculation.At present, information systems (IS) in higher education are usually supported by databases (DB) and accessed through a Web interface. So happens with SiFEUP, the IS of the Engineering Faculty of the University of Porto (FEUP). The typical SiFEUP user sees the system as a collection of Web pages and is not aware of the fact that most of them do not exist in the sense of being an actual HTML file stored in a server but corresponds to HTML code generated on the fly by a designated program that accesses the DB and brings the most up-to-date information to the user desktop. Typical search engines do not index dynamically generated Web pages or just do that for those that are specifically mentioned in a static page and do not follow on the links the dynamic page may contain. In this paper we describe the development of a search facility for SiFEUP, how the limitations put to indexing dynamic Web pages were circumvented, and an evaluation of the results obtained. The solution involves using a locally developed crawler, the Oracle Text full text indexer, plus meta-information automatically drawn from the DB or manually added to improve the relevance factor calculation

Repositório Aberto da Universidade do Porto

Searching a database based web site

Author: Filipe Silva:: 0000-0001-8938-2970::CD1F-46B4-5DB0
Gabriel David
Publication venue
Publication date: 01/01/2003
Field of study

Currently, information systems are usually supported by databases (DB) and accessed through a Web interface. Pages in such Web sites are not drawn from HTML files but are generated on the fly upon request. Indexing and searching such dynamic pages raises several extra difficulties not solved by most search engines, which were designed for static contents. In this paper we describe the development of a search engine that overcomes most of the problems for a specific Web site, how the limitations put to indexing dynamic Web pages were circumvented, and an evaluation of the results obtained. The solution involves using a locally developed crawler, the Oracle Text full text indexer, and meta-information automatically drawn from the DB or manually added to improve the relevance factor calculation. It has the advantage of uniformly covering the dynamic pages and the static Web pages of the site.Currently, information systems are usually supported by databases (DB) and accessed through a Web interface. Pages in such Web sites are not drawn from HTML files but are generated on the fly upon request. Indexing and searching such dynamic pages raises several extra difficulties not solved by most search engines, which were designed for static contents. In this paper we describe the development of a search engine that overcomes most of the problems for a specific Web site, how the limitations put to indexing dynamic Web pages were circumvented, and an evaluation of the results obtained. The solution involves using a locally developed crawler, the Oracle Text full text indexer, and meta-information automatically drawn from the DB or manually added to improve the relevance factor calculation. It has the advantage of uniformly covering the dynamic pages and the static Web pages of the site

Repositório Aberto da Universidade do Porto

CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web

Author: Estrada-Esquivel Hugo
Martínez-Rebollar Alicia
Pech-May Fernando
Pedroza-Landa Eduardo
Publication venue: 'Universidad Catolica Luis Amigo'
Publication date: 01/01/2015
Field of study

The web is the most used information source in both academic, scientific and industry forums. Its explosive growth has generated billions of pages with information which may be categorized as surface web, composed of static pages that are indexed into a hidden web, accessible through search templates. This paper presents the development of a crawler that allows searching, queries, and analysis of information in the surface web and hidden in specific domains of the web

Fundación Universitaria Luis Amigó (FUNLAM): Revistas en Línea

Sampling, information extraction and summarisation of Hidden Web databases

Author: Hedley Y
James A
Sanderson M
Younas M
Publication venue
Publication date: 01/01/2006
Field of study

Hidden Web databases maintain a collection of specialised documents, which are dynamically generated using page templates. This paper presents the Two-Phase Sampling (2PS) technique that detects and extracts query-related information from documents contained in databases. 2PS is based on a two-phase framework for the sampling, information extraction and summarisation of Hidden Web documents. In the first phase, 2PS samples and stores documents for further analysis. In the second phase, it detects Web page templates from sampled documents and extracts relevant information from which a content summary is then generated. Experimental results demonstrate that 2PS effectively eliminates irrelevant information from sampled documents and generates terms and frequencies with improved accuracy

RMIT Research Repository

White Rose Research Online

Oxford Brookes University: RADAR

Utilizing Output in Web Application Server-Side Testing

Author: Alshahwan N
Publication venue: UCL (University College London)
Publication date: 28/09/2012
Field of study

This thesis investigates the utilization of web application output in enhancing automated server-side code testing. The server-side code is the main driving force of a web application generating client-side code, maintaining the state and communicating with back-end resources. The output observed in those elements provides a valuable resource that can potentially enhance the efficiency and effectiveness of automated testing. The thesis aims to explore the use of this output in test data generation, test sequence regeneration, augmentation and test case selection. This thesis also addresses the web-specific challenges faced when applying search based test data generation algorithms to web applications and dataflow analysis of state variables to test sequence regeneration. The thesis presents three tools and four empirical studies to implement and evaluate the proposed approaches: SWAT (Search based Web Application Tester) is a first application of search based test data generation algorithms for web applications. It uses values dynamically mined from the intermediate and the client-side output to enhance the search based algorithm. SART (State Aware Regeneration Tool) uses dataflow analysis of state variables, session state and database tables, and their values to regenerate new sequences from existing sequences. SWAT-U (SWAT-Uniqueness) augments test suites with test cases that produce outputs not observed in the original test suite’s output. Finally, the thesis presents an empirical study of the correlation between new output based test selection criteria and fault detection and structural coverage. The results confirm that using the output does indeed enhance the effectiveness and efficiency of search based test data generation and enhances test suites’ effectiveness for test sequence regeneration and augmentation. The results also report that output uniqueness criteria are strongly correlated with both fault detection and structural coverage and are complementary to structural coverage

UCL Discovery

On the Automatic Extraction of Data from the Hidden Web

Author: A. C. Tamhane
D. Florescu
D.W. Embley
H. Kautz
P. Tryfos
R. A. McLean
R.L. Plackett
S. Lawrence
T. Leonard
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

On the Automatic Extraction of Data from the Hidden Web Stephen W. Li9RU

Author: David W. Embley
Sai Ho Yau
Stephen W. Liddle
Publication venue
Publication date
Field of study

An increasing amount of Web data is accessible only by filling out HTML forms to query an underlying data source. While this is most welcome from a user perspective (queries are easy and precise) and from a data management perspective (static pages need not be maintained; databases can be accessed directly), automated agents have greater difficulty accessing data behind forms. In this paper we present a method for automatically filling in forms to retrieve the associated dynamically generated pages. Using our approach automated agents can begin to systematically access portions of the "hidden Web."

CiteSeerX