Search CORE

1,004 research outputs found

A Benchmark Suite for Template Detection and Content Extraction

Author: Alarte Julián
Silva Josep
Publication venue
Publication date: 30/11/2020
Field of study

Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objective is different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.Comment: 13 pages, 3 table

arXiv.org e-Print Archive

An Analysis of the Current Program Slicing and Algorithmic Debugging Based Techniques

Author: Silva Galiana Josep Francesc
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 10/01/2012
Field of study

This thesis presents a classification of program slicing based techniques. The classification allows us to identify the differences between existing techniques, but it also allows us to predict new slicing techniques. The study identifies and compares the dimensions that influence current techniques.Silva Galiana, JF. (2008). An Analysis of the Current Program Slicing and Algorithmic Debugging Based Techniques. http://hdl.handle.net/10251/14300Archivo delegad

RiuNet

Page-Level Main Content Extraction from Heterogeneous Webpages

Author: Alarte Julián
Silva Josep
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2021
Field of study

[EN] The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, and comments. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, and computing time. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, and content adaptation to low resolutions. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, and animations. It is a Document Object Model-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it is efficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques.This work has been partially supported by the EU (FEDER) and the Spanish MCI/AEI under grants TIN2016-76843-C4-1-R and PID2019-104735RB-C41, by the Generalitat Valenciana under grant Prometeo/2019/098 (DeepTrust), and by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.Alarte, J.; Silva, J. (2021). Page-Level Main Content Extraction from Heterogeneous Webpages. ACM Transactions on Knowledge Discovery from Data. 15(6):1-21. https://doi.org/10.1145/3451168S12115

RiuNet

ADJUSTING CORRELATION MATRICES

Author: Begoña Subiza
Josep Enric Peris Ferrando
José Angel Silva
Ángel León
Publication venue
Publication date
Field of study

The article proposes a new algorithm for adjusting correlation matrices and for comparison with Finger's algorithm, which is used to compute Value-at-Risk in RiskMetrics for stress test scenarios. The solution proposed by the new methodology is always better than Finger's approach in the sense that it alters as little as possible those correlations that we do not wish to alter but they change in order to obtain a consistent Finger correlation matrix.Stochastic, Volatility, Skewness, Kurtosis, Pricing.

Research Papers in Economics

Materiales y recursos digitales

Author: Silva i Galán Josep M.
Publication venue
Publication date: 01/01/2011
Field of study

La oferta de herramientas, materiales y contenidos educativos digitales es ilimitada y obliga al profesorado a buscarlos, seleccionarlos y almacenarlos de una forma muy distinta a cómo estaba acostumbrado. ¿De qué recursos dispone? ¿Cómo puede organizarlos? ¿Y compartirlos

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Diposit Digital de Documents de la UAB