Search CORE

111 research outputs found

Web Content Extraction Techniques: A survey

Author: Kinnari Ajmera, Khushali Deulkar
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/11/2015
Field of study

As technology grows everyday and the amount of research done in various fields rises exponentially the amount of this information being published on the World Wide Web rises in a similar fashion. Along with the rise in useful information being published on the world wide web the amount of excess irrelevant information termed as ‘noise’ is also published in the form of (advertisement, links, scrollers, etc.). Thus now-a-days systems are being developed for data pre-processing and cleaning for real-time applications. Also these systems help other analyzing systems such as social network mining, web mining, data mining, etc to analyze the data in real time or even special tasks such as false advertisement detection, demand forecasting, and comment extraction on product and service reviews. For web content extraction task, researchers have proposed many different methods, such as wrapper-based method, DOM tree rule-based method, machine learning-based method and so on. This paper presents a comparative study of 4 recently proposed methods for web content extraction. These methods have used the traditional DOM tree rule-based method as the base and worked on using other tools to express better results

International Journal on Recent and Innovation Trends in Computing and Communication

Web Content Extraction - a Meta-Analysis of its Past and Thoughts on its Future

Author: Crescenzi Valter
Gottron Thomas
Merialdo Paolo
Palacios Rodrigo
Weninger Tim
Publication venue
Publication date: 01/01/2015
Field of study

In this paper, we present a meta-analysis of several Web content extraction algorithms, and make recommendations for the future of content extraction on the Web. First, we find that nearly all Web content extractors do not consider a very large, and growing, portion of modern Web pages. Second, it is well understood that wrapper induction extractors tend to break as the Web changes; heuristic/feature engineering extractors were thought to be immune to a Web site's evolution, but we find that this is not the case: heuristic content extractor performance also tends to degrade over time due to the evolution of Web site forms and practices. We conclude with recommendations for future work that address these and other findings.Comment: Accepted for publication in SIGKDD Exploration

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Roma 3

Deepec: An Approach For Deep Web Content Extraction And Cataloguing

Author: Mello Ronaldo S
Souza Augusto F
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/07/2013
Field of study

AIS Electronic Library (AISeL)

Recommended from our members

An Experimental Study to Identify the Impact of Expert Filter Tokens for Syllabus Based Searches

Author: Hilal Saba
Rizvi S.A.M
Publication venue: CSUSB ScholarWorks
Publication date: 01/06/2014
Field of study

People search using search engines like Google. While specifying the query, some other words (filter tokens) are used with the topic being searched. These help in the filtering of the relevant and effective search results. SBWCE is being developed to make the Syllabus based Web Content Extraction easy and automatic. As SBWCE is based on using Expert Filter Tokens, there was a need to study their impact. Therefore, a framework to study the impact of these expert filter tokens was used on ‘Computer Organization and Assembly Language’, for experimental study. This paper presents the related details

CSUSB ScholarWorks

Automatic Web Content Extraction by Combination of Learning and Grouping

Author: Jerry Liu
Jian Fan
Shanchan Wu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/11/2015
Field of study

Web pages consist of not only actual content, but also other ele-ments such as branding banners, navigational elements, advertise-ments, copyright etc. This noisy content is typically not related to the main subjects of the webpages. Identifying the part of ac-tual content, or clipping web pages, has many applications, such as high quality web printing, e-reading on mobile devices and data mining. Although there are many existing methods attempting to address this task, most of them can either work only on certain types of Web pages, e.g. article pages, or has to develop differ-ent models for different websites. We formulate the actual content identifying problem as a DOM tree node selection problem. We develop multiple features by utilizing the DOM tree node proper-ties to train a machine learning model. Then candidate nodes are selected based on the learning model. Based on the observation that the actual content is usually located in a spatially continuous block, we develop a grouping technology to further filter out noisy data and pick missing data for the candidate nodes. We conduct ex-tensive experiments on a real dataset and demonstrate our solution has high quality outputs and outperforms several baseline methods

CiteSeerX

Crossref

Recommended from our members

The Syllabus Based Web Content Extractor (SBWCE)

Author: Hilal Saba
Rizvi S.A.M.
Publication venue: CSUSB ScholarWorks
Publication date: 02/06/2014
Field of study

Syllabus Based Web Content Extractor (SBWCE) introduces a new technique of Syllabus Based Web Content Mining. It makes the Syllabus Based Web Content Extraction easy and creates an instant online book view based on the links relevant to the given Syllabus. Three important contributions are made by the current work. First, as multiple format educational information is needed for Syllabus based content; the technique used makes the finding of such content easier. Second, a new approach for capturing and recording the heuristics involved during searching by experts is used. Third, the grouping of Syllabus Words for precise extraction is exploited. This paper introduces SBWCE and presents the related details

CSUSB ScholarWorks

A Web Data Extraction Approach to Harvesting Data from Online Sources

Author: Haugaasen Magnus
Nayak Richi
Publication venue: 'IOS Press'
Publication date: 01/01/2006
Field of study

Queensland University of Technology ePrints Archive

Website Content Extraction Using Web Structure Analysis

Author: Daraham Nor Hayati
Publication venue: Universiti Teknologi Petronas
Publication date: 01/12/2005
Field of study

The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient to relevant information within huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results. For this project a domain-oriented approach to Web data extraction and discuss it application to extracting news from Web Sites. It will use the abstraction method to identify important sections in a web document. The relevance information will be taken account and will be highlighted in order to develop a focused web content output. The fact-finding and data about the project are gathered from various sources such as internet, and books. The methodology used is a Waterfall Model that involves several phases which are Planning, Analysis, Design and Implementation. The result of this project is the display and review of web content extraction and how it being currently being developed which the goals is to give more usability and easiness toward web users

UTPedia

Web content extraction using Web structure analysis / by Nor Hayati Binti Daraham, HD 30.37 .N822 2005

Author: Daraham Nor Hayati
Publication venue: Universiti Teknologi PETRONAS
Publication date: 01/01/2005
Field of study

UTPedia

Boilerplate Removal using a Neural Sequence Labeling Model

Author: Anand Avishek
Khosla Megha
Leonhardt Jurek
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/04/2020
Field of study

The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape

arXiv.org e-Print Archive

Crossref