20,033 research outputs found

    What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors

    Full text link
    "© ACM, 2019. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in PUBLICATION, {VOL 13, ISS 2, (APR 2019)} http://doi.acm.org/10.1145/3316810"[EN] A Web template is a resource that implements the structure and format of a website, making it ready for plugging content into already formatted and prepared pages. For this reason, templates are one of the main development resources for website engineers, because they increase productivity. Templates are also useful for the final user, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information, such as advertisements, menus, and banners. Processing and storing this information leads to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. There exist many techniques and tools for template extraction, but, unfortunately, it is not clear at all which template extractor should a user/system use, because they have never been compared, and because they present different (complementary) features such as precision, recall, and efficiency. In this work, we compare the most advanced template extractors. We implemented and evaluated five of the most advanced template extractors in the literature. To compare all of them, we implemented a workbench, where they have been integrated and evaluated. Thanks to this workbench, we can provide a fair empirical comparison of all methods using the same benchmarks, technology, implementation language, and evaluation criteria.This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Ciencia, Innovacion y Universidades/AEI under grant TIN2016-76843-C4-1-R and by the Generalitat Valenciana under grants PROMETEO-II/2015/013 (SmartLogic) and Prometeo/2019/098 (DeepTrust).Alarte, J.; Silva, J.; Tamarit Muñoz, S. (2019). What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors. ACM Transactions on the Web. 13(2):9:1-9:19. https://doi.org/10.1145/3316810S9:19:19132Alarte, J., Insa, D., Silva, J., & Tamarit, S. (2015). TeMex. Proceedings of the 24th International Conference on World Wide Web - WWW ’15 Companion. doi:10.1145/2740908.2742835Julián Alarte David Insa Josep Silva and Salvador Tamarit. 2016. Site-Level Web Template Extraction Based on DOM Analysis. Springer International Publishing Cham 36--49. Julián Alarte David Insa Josep Silva and Salvador Tamarit. 2016. Site-Level Web Template Extraction Based on DOM Analysis. Springer International Publishing Cham 36--49.Alassi, D., & Alhajj, R. (2013). Effectiveness of template detection on noise reduction and websites summarization. Information Sciences, 219, 41-72. doi:10.1016/j.ins.2012.07.022Bar-Yossef, Z., & Rajagopalan, S. (2002). Template detection via data mining and its applications. Proceedings of the eleventh international conference on World Wide Web - WWW ’02. doi:10.1145/511446.511522Chakrabarti, D., Kumar, R., & Punera, K. (2007). Page-level template detection via isotonic smoothing. Proceedings of the 16th international conference on World Wide Web - WWW ’07. doi:10.1145/1242572.1242582Chen, L., Ye, S., & Li, X. (2006). Template detection for large scale search engines. Proceedings of the 2006 ACM symposium on Applied computing - SAC ’06. doi:10.1145/1141277.1141534Gibson, D., Punera, K., & Tomkins, A. (2005). The volume and evolution of web page templates. Special interest tracks and posters of the 14th international conference on World Wide Web - WWW ’05. doi:10.1145/1062745.1062763Kim, C., & Shim, K. (2011). TEXT: Automatic Template Extraction from Heterogeneous Web Pages. IEEE Transactions on Knowledge and Data Engineering, 23(4), 612-626. doi:10.1109/tkde.2010.140Barbara Ann Kitchenham David Budgen and Pearl Brereton. 2015. Evidence-Based Software Engineering and Systematic Reviews. Chapman 8 Hall/CRC. Barbara Ann Kitchenham David Budgen and Pearl Brereton. 2015. Evidence-Based Software Engineering and Systematic Reviews. Chapman 8 Hall/CRC.Kołcz, A., & Yih, W. (s. f.). Site-Independent Template-Block Detection. Lecture Notes in Computer Science, 152-163. doi:10.1007/978-3-540-74976-9_17Kohlschütter, C. (2009). A densitometric analysis of web template content. Proceedings of the 18th international conference on World wide web - WWW ’09. doi:10.1145/1526709.1526909Jing Li and C. I. Ezeife. 2006. Cleaning web pages for effective web content mining. In Database and Expert Systems Applications Stéphane Bressan Josef Küng and Roland Wagner (Eds.). Springer Berlin 560--571. 10.1007/11827405_55 Jing Li and C. I. Ezeife. 2006. Cleaning web pages for effective web content mining. In Database and Expert Systems Applications Stéphane Bressan Josef Küng and Roland Wagner (Eds.). Springer Berlin 560--571. 10.1007/11827405_55Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks Contents and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York Inc. Secaucus NJ. Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks Contents and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York Inc. Secaucus NJ.Liu, L., Han, W., Buttler, D., Pu, C., & Tang, W. (1999). An XJML-based wrapper generator for Web information extraction. Proceedings of the 1999 ACM SIGMOD international conference on Management of data - SIGMOD ’99. doi:10.1145/304182.304570Ma, L., Goharian, N., Chowdhury, A., & Chung, M. (2003). Extracting unstructured data from template generated web documents. Proceedings of the twelfth international conference on Information and knowledge management - CIKM ’03. doi:10.1145/956863.956961Manjula, R., & Chilambuchelvan, A. (2013). Extracting templates from Web pages. 2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE). doi:10.1109/icgce.2013.6823541Christopher D. Manning Prabhakar Raghavan and Hinrich SchÃijtze. 2008. Introduction to Information Retrieval. Cambridge University Press New York NY. Christopher D. Manning Prabhakar Raghavan and Hinrich SchÃijtze. 2008. Introduction to Information Retrieval. Cambridge University Press New York NY.Meng, X., Hu, D., & Li, C. (2003). Schema-guided wrapper maintenance for web-data extraction. Proceedings of the fifth ACM international workshop on Web information and data management - WIDM ’03. doi:10.1145/956699.956701Nguyen, D. Q., Nguyen, D. Q., Pham, S. B., & Bui, T. D. (2009). A Fast Template-Based Approach to Automatically Identify Primary Text Content of a Web Page. 2009 International Conference on Knowledge and Systems Engineering. doi:10.1109/kse.2009.39Schäfer, R. (2016). Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Language Resources and Evaluation, 51(3), 873-889. doi:10.1007/s10579-016-9359-2Sivakumar, P. (2015). Effectual Web Content Mining using Noise Removal from Web Pages. Wireless Personal Communications, 84(1), 99-121. doi:10.1007/s11277-015-2596-7Song, D., Sun, F., & Liao, L. (2013). A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowledge and Information Systems, 42(1), 75-96. doi:10.1007/s10115-013-0687-xR. Uma and B. Latha. 2018. Noise elimination from web pages for efficacious information retrieval. Cluster Comput. (Mar. 2018). https://link.springer.com/article/10.1007/s10586-018-2366-x#citeas. R. Uma and B. Latha. 2018. Noise elimination from web pages for efficacious information retrieval. Cluster Comput. (Mar. 2018). https://link.springer.com/article/10.1007/s10586-018-2366-x#citeas.Uzun, E., Agun, H. V., & Yerlikaya, T. (2013). A hybrid approach for extracting informative content from web pages. Information Processing & Management, 49(4), 928-944. doi:10.1016/j.ipm.2013.02.005Vieira, K., da Costa Carvalho, A. L., Berlt, K., de Moura, E. S., da Silva, A. S., & Freire, J. (2009). On Finding Templates on Web Collections. World Wide Web, 12(2), 171-211. doi:10.1007/s11280-009-0059-3Vieira, K., da Silva, A. S., Pinto, N., de Moura, E. S., Cavalcanti, J. M. B., & Freire, J. (2006). A fast and robust method for web page template detection and removal. Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM ’06. doi:10.1145/1183614.1183654Thijs Vogels Octavian-Eugen Ganea and Carsten Eickhoff. 2018. Web2Text: Deep structured boilerplate removal. CoRR abs/1801.02607 (2018). Retrieved from http://arxiv.org/abs/1801.02607. Thijs Vogels Octavian-Eugen Ganea and Carsten Eickhoff. 2018. Web2Text: Deep structured boilerplate removal. CoRR abs/1801.02607 (2018). Retrieved from http://arxiv.org/abs/1801.02607.Wang, Y., Fang, B., Cheng, X., Guo, L., & Xu, H. (2008). Incremental web page template detection. Proceeding of the 17th international conference on World Wide Web - WWW ’08. doi:10.1145/1367497.1367749Yi, L., Liu, B., & Li, X. (2003). Eliminating noisy information in Web pages for data mining. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’03. doi:10.1145/956750.956785Zheng, S., Song, R., Wen, J.-R., & Giles, C. L. (2009). Efficient record-level wrapper induction. Proceeding of the 18th ACM conference on Information and knowledge management - CIKM ’09. doi:10.1145/1645953.1645962Zheng, S., Song, R., Wen, J.-R., & Wu, D. (2007). Joint optimization of wrapper generation and template detection. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’07. doi:10.1145/1281192.128128

    TeMex: The Web Template Extractor

    Full text link
    "© ACM} 2015. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM, In Proceedings of the 24th International Conference on World Wide Web (pp. 155-158), http://dx.doi.org/10.1145/2740908.2742835This paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, more importantly, it does not need a prede- fined set of webpages to perform the analysis. TeMex only needs a URL. Contrarily to previous approaches, it includes a mechanism to identify webpage candidates that share the same template. This mechanism increases both recall and precision, and it also reduces the amount of webpages loaded and processed. We describe the tool and its internal architecture, and we present the results of its empirical evaluation.This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Economía y Competitividad (Secretaría de Estado de Investigación, Desarrollo e Innovación) under Grant TIN2013-44742-C4-1-R and by the Generalitat Valenciana under Grant PROMETEOII/2015/013. David Insa was partially supported by the Spanish Ministerio de Educación under FPU Grant AP2010-4415. Salvador Tamarit was partially supported by research project POLCA, Programming Large Scale Heterogeneous Infrastructures (610686), funded by the European Union, STREP FP7.Alarte, J.; Insa Cabrera, D.; Silva Galiana, JF.; Tamarit Muñoz, S. (2015). TeMex: The Web Template Extractor. ACM. https://doi.org/10.1145/2740908.2742835SOverlay extension. Available from URL: https://developer.mozilla.org/en-US/Add-ons/Overlay_Extensions, 2005.J. Alarte, D. Insa, J. Silva, and S. Tamarit. Automatic Detection of Webpages that Share the Same Web Template. In M. H. ter Beek and A. Ravara, editors, Proceedings of the 10th International Workshop on Automated Specification and Verification of Web Systems (WWV 14), volume 163 of Electronic Proceedings in Theoretical Computer Science, pages 2--15. Open Publishing Association, July 2014.J. Alarte, D. Insa, J. Silva, and S. Tamarit. A Benchmark Suite for Template Detection and Content Extraction. CoRR, abs/1409.6182, 2014.Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web (WWW'02), pages 580--591, New York, NY, USA, 2002. ACM.M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff. Cleaneval: a Competition for Cleaning Web Pages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC'08), pages 638--643. European Language Resources Association, may 2008.D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In A. Ellis and T. Hagino, editors, Proceedings of the 14th International Conference on World Wide Web (WWW'05), pages 830--839. ACM, may 2005.T. Gottron. Evaluating content extraction on HTML documents. In V. Grout, D. Oram, and R. Picking, editors, Proceedings of the 2nd International Conference on Internet Technologies and Applications (ITA'07), pages 123--132. National Assembly for Wales, sep 2007.D. d. C. Reis, P. B. Golgher, A. S. Silva, and A. H. F. Laender. Automatic web news extraction using tree edit distance. In Proceedings of the 13th International Conference on World Wide Web (WWW'04), pages 502--511, New York, NY, USA, 2004. ACM.K. Vieira, A. L. da Costa Carvalho, K. Berlt, E. S. de Moura, A. S. da Silva, and J. Freire. On finding templates on web collections. World Wide Web, 12(2):171--211, 2009.K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. a. M. B. Cavalcanti, and J. Freire. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06), pages 258--267, New York, NY, USA, 2006. ACM.T. Weninger, W. Henry Hsu, and J. Han. CETR: Content Extraction via Tag Ratios. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web (WWW'10), pages 971--980. ACM, apr 2010.L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD'03), pages 296--305, New York, NY, USA, 2003. ACM

    Coping with noise in a real-world weblog crawler and retrieval system

    Get PDF
    In this paper we examine the effects of noise when creating a real-world weblog corpus for information retrieval. We focus on the DiffPost (Lee et al. 2008) approach to noise removal from blog pages, examining the difficulties encountered when crawling the blogosphere during the creation of a real-world corpus of blog pages. We introduce and evaluate a number of enhancements to the original DiffPost approach in order to increase the robustness of the algorithm. We then extend DiffPost by looking at the anchor-text to text ratio, and dis- cover that the time-interval between crawls is more impor- tant to the successful application of noise-removal algorithms within the blog context, than any additional improvements to the removal algorithm itself

    Vision-Based Road Detection in Automotive Systems: A Real-Time Expectation-Driven Approach

    Full text link
    The main aim of this work is the development of a vision-based road detection system fast enough to cope with the difficult real-time constraints imposed by moving vehicle applications. The hardware platform, a special-purpose massively parallel system, has been chosen to minimize system production and operational costs. This paper presents a novel approach to expectation-driven low-level image segmentation, which can be mapped naturally onto mesh-connected massively parallel SIMD architectures capable of handling hierarchical data structures. The input image is assumed to contain a distorted version of a given template; a multiresolution stretching process is used to reshape the original template in accordance with the acquired image content, minimizing a potential function. The distorted template is the process output.Comment: See http://www.jair.org/ for any accompanying file

    Multi-resolution internal template cleaning: An application to the Wilkinson Microwave Anisotropy Probe 7-yr polarization data

    Get PDF
    Cosmic microwave background (CMB) radiation data obtained by different experiments contain, besides the desired signal, a superposition of microwave sky contributions. We present a fast and robust method, using a wavelet decomposition on the sphere, to recover the CMB signal from microwave maps. An application to \textit{WMAP} polarization data is presented, showing its good performance particularly in very polluted regions of the sky. The applied wavelet has the advantages of requiring little computational time in its calculations, being adapted to the \textit{HEALPix} pixelization scheme, and offering the possibility of multi-resolution analysis. The decomposition is implemented as part of a fully internal template fitting method, minimizing the variance of the resulting map at each scale. Using a χ2\chi^2 characterization of the noise, we find that the residuals of the cleaned maps are compatible with those expected from the instrumental noise. The maps are also comparable to those obtained from the \textit{WMAP} team, but in our case we do not make use of external data sets. In addition, at low resolution, our cleaned maps present a lower level of noise. The E-mode power spectrum CEEC_{\ell}^{EE} is computed at high and low resolution; and a cross power spectrum CTEC_{\ell}^{TE} is also calculated from the foreground reduced maps of temperature given by \textit{WMAP} and our cleaned maps of polarization at high resolution. These spectra are consistent with the power spectra supplied by the \textit{WMAP} team. We detect the E-mode acoustic peak at 400\ell \sim 400, as predicted by the standard ΛCDM\Lambda CDM model. The B-mode power spectrum CBBC_{\ell}^{BB} is compatible with zero.Comment: 8 pages, 6 figures. Some changes have been done from the original manuscript. This paper is accepted by MNRA
    corecore