7 research outputs found

    Searching dynamic Web pages with semi-structured contents

    Get PDF
    At present, information systems (IS) in higher education are usually supported by databases (DB) and accessed through a Web interface. So happens with SiFEUP, the IS of the Engineering Faculty of the University of Porto (FEUP). The typical SiFEUP user sees the system as a collection of Web pages and is not aware of the fact that most of them do not exist in the sense of being an actual HTML file stored in a server but corresponds to HTML code generated on the fly by a designated program that accesses the DB and brings the most up-to-date information to the user desktop. Typical search engines do not index dynamically generated Web pages or just do that for those that are specifically mentioned in a static page and do not follow on the links the dynamic page may contain. In this paper we describe the development of a search facility for SiFEUP, how the limitations put to indexing dynamic Web pages were circumvented, and an evaluation of the results obtained. The solution involves using a locally developed crawler, the Oracle Text full text indexer, plus meta-information automatically drawn from the DB or manually added to improve the relevance factor calculation.At present, information systems (IS) in higher education are usually supported by databases (DB) and accessed through a Web interface. So happens with SiFEUP, the IS of the Engineering Faculty of the University of Porto (FEUP). The typical SiFEUP user sees the system as a collection of Web pages and is not aware of the fact that most of them do not exist in the sense of being an actual HTML file stored in a server but corresponds to HTML code generated on the fly by a designated program that accesses the DB and brings the most up-to-date information to the user desktop. Typical search engines do not index dynamically generated Web pages or just do that for those that are specifically mentioned in a static page and do not follow on the links the dynamic page may contain. In this paper we describe the development of a search facility for SiFEUP, how the limitations put to indexing dynamic Web pages were circumvented, and an evaluation of the results obtained. The solution involves using a locally developed crawler, the Oracle Text full text indexer, plus meta-information automatically drawn from the DB or manually added to improve the relevance factor calculation

    Searching a database based web site

    Get PDF
    Currently, information systems are usually supported by databases (DB) and accessed through a Web interface. Pages in such Web sites are not drawn from HTML files but are generated on the fly upon request. Indexing and searching such dynamic pages raises several extra difficulties not solved by most search engines, which were designed for static contents. In this paper we describe the development of a search engine that overcomes most of the problems for a specific Web site, how the limitations put to indexing dynamic Web pages were circumvented, and an evaluation of the results obtained. The solution involves using a locally developed crawler, the Oracle Text full text indexer, and meta-information automatically drawn from the DB or manually added to improve the relevance factor calculation. It has the advantage of uniformly covering the dynamic pages and the static Web pages of the site.Currently, information systems are usually supported by databases (DB) and accessed through a Web interface. Pages in such Web sites are not drawn from HTML files but are generated on the fly upon request. Indexing and searching such dynamic pages raises several extra difficulties not solved by most search engines, which were designed for static contents. In this paper we describe the development of a search engine that overcomes most of the problems for a specific Web site, how the limitations put to indexing dynamic Web pages were circumvented, and an evaluation of the results obtained. The solution involves using a locally developed crawler, the Oracle Text full text indexer, and meta-information automatically drawn from the DB or manually added to improve the relevance factor calculation. It has the advantage of uniformly covering the dynamic pages and the static Web pages of the site

    CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web

    Get PDF
    The web is the most used information source in both academic, scientific and industry forums. Its explosive growth has generated billions of pages with information which may be categorized as surface web, composed of static pages that are indexed into a hidden web, accessible through search templates. This paper presents the development of a crawler that allows searching, queries, and analysis of information in the surface web and hidden in specific domains of the web

    Sampling, information extraction and summarisation of Hidden Web databases

    Get PDF
    Hidden Web databases maintain a collection of specialised documents, which are dynamically generated using page templates. This paper presents the Two-Phase Sampling (2PS) technique that detects and extracts query-related information from documents contained in databases. 2PS is based on a two-phase framework for the sampling, information extraction and summarisation of Hidden Web documents. In the first phase, 2PS samples and stores documents for further analysis. In the second phase, it detects Web page templates from sampled documents and extracts relevant information from which a content summary is then generated. Experimental results demonstrate that 2PS effectively eliminates irrelevant information from sampled documents and generates terms and frequencies with improved accuracy

    Utilizing Output in Web Application Server-Side Testing

    Get PDF
    This thesis investigates the utilization of web application output in enhancing automated server-side code testing. The server-side code is the main driving force of a web application generating client-side code, maintaining the state and communicating with back-end resources. The output observed in those elements provides a valuable resource that can potentially enhance the efficiency and effectiveness of automated testing. The thesis aims to explore the use of this output in test data generation, test sequence regeneration, augmentation and test case selection. This thesis also addresses the web-specific challenges faced when applying search based test data generation algorithms to web applications and dataflow analysis of state variables to test sequence regeneration. The thesis presents three tools and four empirical studies to implement and evaluate the proposed approaches: SWAT (Search based Web Application Tester) is a first application of search based test data generation algorithms for web applications. It uses values dynamically mined from the intermediate and the client-side output to enhance the search based algorithm. SART (State Aware Regeneration Tool) uses dataflow analysis of state variables, session state and database tables, and their values to regenerate new sequences from existing sequences. SWAT-U (SWAT-Uniqueness) augments test suites with test cases that produce outputs not observed in the original test suite’s output. Finally, the thesis presents an empirical study of the correlation between new output based test selection criteria and fault detection and structural coverage. The results confirm that using the output does indeed enhance the effectiveness and efficiency of search based test data generation and enhances test suites’ effectiveness for test sequence regeneration and augmentation. The results also report that output uniqueness criteria are strongly correlated with both fault detection and structural coverage and are complementary to structural coverage

    On the Automatic Extraction of Data from the Hidden Web Stephen W. Li9RU

    No full text
    An increasing amount of Web data is accessible only by filling out HTML forms to query an underlying data source. While this is most welcome from a user perspective (queries are easy and precise) and from a data management perspective (static pages need not be maintained; databases can be accessed directly), automated agents have greater difficulty accessing data behind forms. In this paper we present a method for automatically filling in forms to retrieve the associated dynamically generated pages. Using our approach automated agents can begin to systematically access portions of the "hidden Web."
    corecore