5 research outputs found
Web Page Segmentation Algorithms
Segmentace webovĂœch strĂĄnek je jednou z disciplĂn extrakce informacĂ. UmoĆŸĆuje dÄlit strĂĄnky na rĆŻznĂ© sĂ©mantickĂ© bloky. DiplomovĂĄ prĂĄce se zabĂœvĂĄ seznĂĄmenĂm se samotnou segmentacĂ a takĂ© implementacĂ konkrĂ©tnĂ segmentaÄnĂ metody. V prĂĄci jsou popsĂĄny rĆŻznĂ© pĆĂklady metod jako je VIPS, DOM PS atd. Je zde teoretickĂœ popis zvolenĂ© metody a taktĂ©ĆŸ Frameworku FitLayout, kterĂœ bude o tuto metodu rozĆĄĂĆen. DĂĄle je tu podrobnÄji popsanĂĄ implementace zvolenĂ© metody. Popis implementace je zamÄĆen pĆedevĆĄĂm na popis rĆŻznĂœch problĂ©mĆŻ, kterĂ© jsme museli vyĆeĆĄit. NechybĂ zde ani testovĂĄnĂ, kterĂ© pomohlo odhalit nÄkterĂ© nedostatky. V zĂĄvÄru se nachĂĄzĂ shrnutĂ vĂœsledkĆŻ a moĆŸnĂ© nĂĄpady, jak by se dalo navĂĄzat na tuto prĂĄci.Segmentation of web pages is one of the disciplines of information extraction. It allows to divide the page into different semantic blocks. This thesis deals with the segmentation as such and also with the implementation of the segmentation method. In this paper, we describe various examples of methods such as VIPS, DOM PS etc. There is a theoretical description of the chosen method and also the FITLayout Framework, which will be extended by this method. The implementation of the chosen method is also described in detail. The implementation description is focused on describing the different problems we had to solve. We also describe the testing that helped to reveal some weaknesses. The conclusion is a summary of the results and possible ideas for extending this work.
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform