45 research outputs found

    DOM-based Content Extraction of HTML Documents

    Get PDF
    Web pages often contain clutter around the body of the article as well as distracting features that take away from the true information that the user is pursuing. This can range from pop-up ads to flashy banners to unnecessary images and links scattered around the screen. Extraction of 'useful and relevant' content from web pages, has many applications ranging from lightweight environments, like cell phone and PDA browsing, to speech rendering for the visually impaired, to text summarization Most approaches to removing the clutter or making the content more readable involves either changing the size of the font or simply removing certain HTML-denoted components like images, thus taking away from the webpage's inherent look and feel. Unlike Content Reformatting, which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses Content Extraction. We have developed a framework that employs an easily extensible set of techniques that incorporate advantages of previous work on content extraction while limiting the disadvantages. Our key insight is to work with the Document Object Model tree (after parsing and correcting the HTML), rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy that anyone can use to extract content from HTML web pages for their own purposes

    Boilerplate Removal using a Neural Sequence Labeling Model

    Full text link
    The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape

    Where are your Manners? Sharing Best Community Practices in the Web 2.0

    Get PDF
    The Web 2.0 fosters the creation of communities by offering users a wide array of social software tools. While the success of these tools is based on their ability to support different interaction patterns among users by imposing as few limitations as possible, the communities they support are not free of rules (just think about the posting rules in a community forum or the editing rules in a thematic wiki). In this paper we propose a framework for the sharing of best community practices in the form of a (potentially rule-based) annotation layer that can be integrated with existing Web 2.0 community tools (with specific focus on wikis). This solution is characterized by minimal intrusiveness and plays nicely within the open spirit of the Web 2.0 by providing users with behavioral hints rather than by enforcing the strict adherence to a set of rules.Comment: ACM symposium on Applied Computing, Honolulu : \'Etats-Unis d'Am\'erique (2009

    Information Extraction in Illicit Domains

    Full text link
    Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have `long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for such domains. Our approach uses raw, unlabeled text from an initial corpus, and a few (12-120) seed annotations per domain-specific attribute, to learn robust IE models for unobserved pages and websites. Empirically, we demonstrate that our approach can outperform feature-centric Conditional Random Field baselines by over 18\% F-Measure on five annotated sets of real-world human trafficking datasets in both low-supervision and high-supervision settings. We also show that our approach is demonstrably robust to concept drift, and can be efficiently bootstrapped even in a serial computing environment.Comment: 10 pages, ACM WWW 201

    The Designing of a Web Page Recommendation System for ESL

    Get PDF
    [[abstract]]In this paper, a webpage reading recommendation system is constructed through the concept of meta search and article summary technique. The designed system recommends webpages that are related to the current webpage, to provide the user with further reading material. Using article-searching mechanism, the ESL student can avoid using keyword-based search method, thereby greatly decreasing the time spent to look for related articles. The system provides related articles as well as information such as the difficulty of the articles, which would assist English learning, and harbor a more user friendly English learning environment. This in turn increases learning efficiency. A designed toolbar serves as the main medium of communication with the user. All the user has to do is install the toolbar on the browser to gain the assistance from the system.[[conferencetype]]國際[[conferencelocation]]Niigata, Japa
    corecore