5 research outputs found

    Web Page Segmentation Algorithms

    Get PDF
    Segmentace webovĂœch strĂĄnek je jednou z disciplĂ­n extrakce informacĂ­. UmoĆŸĆˆuje dělit strĂĄnky na rĆŻznĂ© sĂ©mantickĂ© bloky. DiplomovĂĄ prĂĄce se zabĂœvĂĄ seznĂĄmenĂ­m se samotnou segmentacĂ­ a takĂ© implementacĂ­ konkrĂ©tnĂ­ segmentačnĂ­ metody. V prĂĄci jsou popsĂĄny rĆŻznĂ© pƙíklady metod jako je VIPS, DOM PS atd. Je zde teoretickĂœ popis zvolenĂ© metody a taktĂ©ĆŸ Frameworku FitLayout, kterĂœ bude o tuto metodu rozơíƙen. DĂĄle je tu podrobněji popsanĂĄ implementace zvolenĂ© metody. Popis implementace je zaměƙen pƙedevĆĄĂ­m na popis rĆŻznĂœch problĂ©mĆŻ, kterĂ© jsme museli vyƙeĆĄit. NechybĂ­ zde ani testovĂĄnĂ­, kterĂ© pomohlo odhalit některĂ© nedostatky. V zĂĄvěru se nachĂĄzĂ­ shrnutĂ­ vĂœsledkĆŻ a moĆŸnĂ© nĂĄpady, jak by se dalo navĂĄzat na tuto prĂĄci.Segmentation of web pages is one of the disciplines of information extraction. It allows to divide the page into different semantic blocks. This thesis deals with the segmentation as such and also with the implementation of the segmentation method. In this paper, we describe various examples of methods such as VIPS, DOM PS etc. There is a theoretical description of the chosen method and also the FITLayout Framework, which will be extended by this method. The implementation of the chosen method is also described in detail. The implementation description is focused on describing the different problems we had to solve. We also describe the testing that helped to reveal some weaknesses. The conclusion is a summary of the results and possible ideas for extending this work.

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
    corecore