5,943 research outputs found
Recommended from our members
Automating Content Extraction of HTML Documents
Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of 'useful and relevant' content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage's inherent look and feel. Unlike 'Content Reformatting', which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses 'Content Extraction'. We have developed a framework that employs an easily extensible set of techniques. It incorporates advantages of previous work on content extraction. Our key insight is to work with DOM trees, a W3C specified interface that allows programs to dynamically access document structure, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages. This proxy can be used both centrally, administered for groups of users, as well as by individuals for personal browsers. We have also, after receiving feedback from users about the proxy, created a revised version with improved performance and accessibility in mind
Self-supervised automated wrapper generation for weblog data extraction
Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Automating Metadata Extraction: Genre Classification
A problem that frequently arises in the management and integration of scientific data is the lack of context and semantics that would link data encoded in disparate ways. To bridge the discrepancy, it often helps to mine scientific texts to aid the understanding of the database. Mining relevant text can be significantly aided by the availability of descriptive and semantic metadata. The Digital Curation Centre (DCC) has undertaken research to automate the extraction of metadata from documents in PDF([22]). Documents may include scientific journal papers, lab notes or even emails. We suggest genre classification as a first step toward automating metadata extraction. The classification method will be built on looking at the documents from five directions; as an object of specific visual format, a layout of strings with characteristic grammar, an object with stylo-metric signatures, an object with meaning and purpose, and an object linked to previously classified objects and external sources. Some results of experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-faceted approach.
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Hi, how can I help you?: Automating enterprise IT support help desks
Question answering is one of the primary challenges of natural language
understanding. In realizing such a system, providing complex long answers to
questions is a challenging task as opposed to factoid answering as the former
needs context disambiguation. The different methods explored in the literature
can be broadly classified into three categories namely: 1) classification
based, 2) knowledge graph based and 3) retrieval based. Individually, none of
them address the need of an enterprise wide assistance system for an IT support
and maintenance domain. In this domain the variance of answers is large ranging
from factoid to structured operating procedures; the knowledge is present
across heterogeneous data sources like application specific documentation,
ticket management systems and any single technique for a general purpose
assistance is unable to scale for such a landscape. To address this, we have
built a cognitive platform with capabilities adopted for this domain. Further,
we have built a general purpose question answering system leveraging the
platform that can be instantiated for multiple products, technologies in the
support domain. The system uses a novel hybrid answering model that
orchestrates across a deep learning classifier, a knowledge graph based context
disambiguation module and a sophisticated bag-of-words search system. This
orchestration performs context switching for a provided question and also does
a smooth hand-off of the question to a human expert if none of the automated
techniques can provide a confident answer. This system has been deployed across
675 internal enterprise IT support and maintenance projects.Comment: To appear in IAAI 201
Information extraction from Webpages based on DOM distances
Retrieving information from Internet is a difficult task as it
is demonstrated by the lack of real-time tools able to extract information
from webpages. The main cause is that most webpages in Internet
are implemented using plain (X)HTML which is a language that lacks
structured semantic information. For this reason much of the efforts in
this area have been directed to the development of techniques for URLs
extraction. This field has produced good results implemented by modern
search engines. But, contrarily, extracting information from a single
webpage has produced poor results or very limited tools. In this work
we define a novel technique for information extraction from single webpages
or collections of interconnected webpages. This technique is based
on DOM distances to retrieve information. This allows the technique
to work with any webpage and, thus, to retrieve information online.
Our implementation and experiments demonstrate the usefulness of the
technique.Castillo, C.; Valero Llinares, H.; Guadalupe Ramos, J.; Silva Galiana, JF. (2012). Information extraction from Webpages based on DOM distances. En Computational Linguistics and Intelligent Text Processing. Springer Verlag (Germany). 181-193. doi:10.1007/978-3-642-28601-8_16S181193Dalvi, B., Cohen, W.W., Callan, J.: Websets: Extracting sets of entities from the web using unsupervised information extraction. Technical report, Carnegie Mellon School of computer Science (2011)Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI 1997) (1997)Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: Proceedings of the international World Wide Web conference (WWW 2002), pp. 232–241 (2002)Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural networks for web content filtering. IEEE Intelligent Systems 17(5), 48–57 (2002)Anti-Porn Parental Controls Software. Porn Filtering (March 2010), http://www.tueagles.com/anti-porn/Kang, B.-Y., Kim, H.-G.: Web page filtering for domain ontology with the context of concept. IEICE - Trans. Inf. Syst. E90, D859–D862 (2007)Henzinger, M.: The Past, Present and Future of Web Information Retrieval. In: Proceedings of the 23th ACM Symposium on Principles of Database Systems (2004)W3C Consortium. Resource Description Framework (RDF), www.w3.org/RDFW3C Consortium. Web Ontology Language (OWL), www.w3.org/2004/OWLMicroformats.org. The Official Microformats Site (2009), http://microformats.orgKhare, R., Çelik, T.: Microformats: a Pragmatic Path to the Semantic Web. In: Proceedings of the 15h International Conference on World Wide Web, pp. 865–866 (2006)Khare, R.: Microformats: The Next (Small) Thing on the Semantic Web? IEEE Internet Computing 10(1), 68–75 (2006)Gupta, S., et al.: Automating Content Extraction of HTML Documents. World Wide Archive 8(2), 179–224 (2005)Li, P., Liu, M., Lin, Y., Lai, Y.: Accelerating Web Content Filtering by the Early Decision Algorithm. IEICE Transactions on Information and Systems E91-D, 251–257 (2008)W3C Consortium, Document Object Model (DOM), www.w3.org/DOMBaeza-Yates, R., Castillo, C.: Crawling the Infinite Web: Five Levels Are Enough. In: Leonardi, S. (ed.) WAW 2004. LNCS, vol. 3243, pp. 156–167. Springer, Heidelberg (2004)Micarelli, A., Gasparetti, F.: Adaptative Focused Crawling. In: The Adaptative Web, pp. 231–262 (2007)Nielsen, J.: Designing Web Usability: The Practice of Simplicity. New Riders Publishing, Indianapolis (2010) ISBN 1-56205-810-XZhang, J.: Visualization for Information Retrieval. The Information Retrieval Series. Springer, Heidelberg (2007) ISBN 3-54075-1475Hearst, M.A.: TileBars: Visualization of Term Distribution Information. In: Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Denver, CO, pp. 59–66 (May 1995)Gottron, T.: Evaluating Content Extraction on HTML Documents. In: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pp. 123–132 (2007)Apache Foundation. The Apache crawler Nutch (2010), http://nutch.apache.or
- …