1,004 research outputs found
A Benchmark Suite for Template Detection and Content Extraction
Template detection and content extraction are two of the main areas of
information retrieval applied to the Web. They perform different analyses over
the structure and content of webpages to extract some part of the document.
However, their objective is different. While template detection identifies the
template of a webpage (usually comparing with other webpages of the same
website), content extraction identifies the main content of the webpage
discarding the other part. Therefore, they are somehow complementary, because
the main content is not part of the template. It has been measured that
templates represent between 40% and 50% of data on the Web. Therefore,
identifying templates is essential for indexing tasks because templates usually
contain irrelevant information such as advertisements, menus and banners.
Processing and storing this information is likely to lead to a waste of
resources (storage space, bandwidth, etc.). Similarly, identifying the main
content is essential for many information retrieval tasks. In this paper, we
present a benchmark suite to test different approaches for template detection
and content extraction. The suite is public, and it contains real heterogeneous
webpages that have been labelled so that different techniques can be suitable
(and automatically) compared.Comment: 13 pages, 3 table
An Analysis of the Current Program Slicing and Algorithmic Debugging Based Techniques
This thesis presents a classification of program slicing based techniques. The classification allows us to identify the differences between existing techniques, but it also allows us to predict new slicing techniques. The study identifies and compares the dimensions that influence current techniques.Silva Galiana, JF. (2008). An Analysis of the Current Program Slicing and Algorithmic Debugging Based Techniques. http://hdl.handle.net/10251/14300Archivo delegad
Page-Level Main Content Extraction from Heterogeneous Webpages
[EN] The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, and comments. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, and computing time. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, and content adaptation to low resolutions. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, and animations. It is a Document Object Model-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it is efficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques.This work has been partially supported by the EU (FEDER) and the Spanish MCI/AEI under grants TIN2016-76843-C4-1-R and PID2019-104735RB-C41, by the Generalitat Valenciana under grant Prometeo/2019/098 (DeepTrust), and by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.Alarte, J.; Silva, J. (2021). Page-Level Main Content Extraction from Heterogeneous Webpages. ACM Transactions on Knowledge Discovery from Data. 15(6):1-21. https://doi.org/10.1145/3451168S12115
ADJUSTING CORRELATION MATRICES
The article proposes a new algorithm for adjusting correlation matrices and for comparison with Finger's algorithm, which is used to compute Value-at-Risk in RiskMetrics for stress test scenarios. The solution proposed by the new methodology is always better than Finger's approach in the sense that it alters as little as possible those correlations that we do not wish to alter but they change in order to obtain a consistent Finger correlation matrix.Stochastic, Volatility, Skewness, Kurtosis, Pricing.
Materiales y recursos digitales
La oferta de herramientas, materiales y contenidos educativos digitales es ilimitada y obliga al profesorado a buscarlos, seleccionarlos y almacenarlos de una forma muy distinta a cómo estaba acostumbrado. ¿De qué recursos dispone? ¿Cómo puede organizarlos? ¿Y compartirlos
- …