11,524 research outputs found

    Content Extraction based on Hierarchical Relations in DOM Structures

    Full text link
    This article introduces a new approach for content extraction that exploits the hierarchical inter-relations of the elements in a webpage. Content extraction is a technique used to extract from a webpage the main textual content. This is useful in order to filter out the advertisements and all the additional information that is not part of the main content. The main idea behind our approach is to use the DOM tree as an explicit representation of the inter-relations of the elements in a webpage. Using the information contained in the DOM tree we can identify blocks of content and we can easily determine what of the blocks contains more text. Thanks to this information, the technique achieves a considerable recall and precision. Using the DOM structure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block, thus, producing very cohesive blocks.López Romero, S.; Silva Galiana, JF.; Insa Cabrera, D. (2012). Content Extraction based on Hierarchical Relations in DOM Structures. Research and Development in Computer Science and Engineering. 45:5-12. http://hdl.handle.net/10251/47738S5124

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Chemodiversity of dissolved organic matter in the Amazon Basin

    Get PDF
    Regions in the Amazon Basin have been associated with specific biogeochemical processes, but a detailed chemical classification of the abundant and ubiquitous dissolved organic matter (DOM), beyond specific indicator compounds and bulk measurements, has not yet been established. We sampled water from different locations in the Negro, Madeira/Jamari and Tapajós River areas to characterize the molecular DOM composition and distribution. Ultrahigh-resolution Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR-MS) combined with excitation emission matrix (EEM) fluorescence spectroscopy and parallel factor analysis (PARAFAC) revealed a large proportion of ubiquitous DOM but also unique area-specific molecular signatures. Unique to the DOM of the Rio Negro area was the large abundance of high molecular weight, diverse hydrogen-deficient and highly oxidized molecular ions deviating from known lignin or tannin compositions, indicating substantial oxidative processing of these ultimately plant-derived polyphenols indicative of these black waters. In contrast, unique signatures in the Madeira/Jamari area were defined by presumably labile sulfur- and nitrogen-containing molecules in this white water river system. Waters from the Tapajós main stem did not show any substantial unique molecular signatures relative to those present in the Rio Madeira and Rio Negro, which implied a lower organic molecular complexity in this clear water tributary, even after mixing with the main stem of the Amazon River. Beside ubiquitous DOM at average H ∕ C and O ∕ C elemental ratios, a distinct and significant unique DOM pool prevailed in the black, white and clear water areas that were also highly correlated with EEM-PARAFAC components and define the frameworks for primary production and other aspects of aquatic life
    corecore