23,688 research outputs found

    Automatic Web Content Extraction by Combination of Learning and Grouping

    Full text link
    Web pages consist of not only actual content, but also other ele-ments such as branding banners, navigational elements, advertise-ments, copyright etc. This noisy content is typically not related to the main subjects of the webpages. Identifying the part of ac-tual content, or clipping web pages, has many applications, such as high quality web printing, e-reading on mobile devices and data mining. Although there are many existing methods attempting to address this task, most of them can either work only on certain types of Web pages, e.g. article pages, or has to develop differ-ent models for different websites. We formulate the actual content identifying problem as a DOM tree node selection problem. We develop multiple features by utilizing the DOM tree node proper-ties to train a machine learning model. Then candidate nodes are selected based on the learning model. Based on the observation that the actual content is usually located in a spatially continuous block, we develop a grouping technology to further filter out noisy data and pick missing data for the candidate nodes. We conduct ex-tensive experiments on a real dataset and demonstrate our solution has high quality outputs and outperforms several baseline methods

    Exploiting the user interaction context for automatic task detection

    Get PDF
    Detecting the task a user is performing on her computer desktop is important for providing her with contextualized and personalized support. Some recent approaches propose to perform automatic user task detection by means of classifiers using captured user context data. In this paper we improve on that by using an ontology-based user interaction context model that can be automatically populated by (i) capturing simple user interaction events on the computer desktop and (ii) applying rule-based and information extraction mechanisms. We present evaluation results from a large user study we have carried out in a knowledge-intensive business environment, showing that our ontology-based approach provides new contextual features yielding good task detection performance. We also argue that good results can be achieved by training task classifiers `online' on user context data gathered in laboratory settings. Finally, we isolate a combination of contextual features that present a significantly better discriminative power than classical ones

    Boilerplate Removal using a Neural Sequence Labeling Model

    Full text link
    The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape

    Digital Image Access & Retrieval

    Get PDF
    The 33th Annual Clinic on Library Applications of Data Processing, held at the University of Illinois at Urbana-Champaign in March of 1996, addressed the theme of "Digital Image Access & Retrieval." The papers from this conference cover a wide range of topics concerning digital imaging technology for visual resource collections. Papers covered three general areas: (1) systems, planning, and implementation; (2) automatic and semi-automatic indexing; and (3) preservation with the bulk of the conference focusing on indexing and retrieval.published or submitted for publicatio
    corecore