Search CORE

143,165 research outputs found

UniversityIE: Information Extraction From University Web Pages

Author: Janevski Angel
Publication venue: UKnowledge
Publication date: 01/01/2000
Field of study

The amount of information available on the web is growing constantly. As a result, theproblem of retrieving any desired information is getting more difficult by the day. Toalleviate this problem, several techniques are currently being used, both for locatingpages of interest and for extracting meaningful information from the retrieved pages.Information extraction (IE) is one such technology that is used for summarizingunrestricted natural language text into a structured set of facts. IE is already being appliedwithin several domains such as news transcripts, insurance information, and weatherreports. Various approaches to IE have been taken and a number of significant resultshave been reported.In this thesis, we describe the application of IE techniques to the domain of universityweb pages. This domain is broader than previously evaluated domains and has a varietyof idiosyncratic problems to address. We present an analysis of the domain of universityweb pages and the consequences of having them input to IE systems. We then presentUniversityIE, a system that can search a web site, extract relevant pages, and processthem for information such as admission requirements or general information. TheUniversityIE system, developed as part of this research, contributes three IE methods anda web-crawling heuristic that worked relatively well and predictably over a test set ofuniversity web sites.We designed UniversityIE as a generic framework for plugging in and executing IEmethods over pages acquired from the web. We also integrated in the system a genericweb crawler (built at the University of Kentucky) and ported to Java and integrated anexternal word lexicon (WordNet) and a syntax parser (Link Grammar Parser)

University of Kentucky

Extraction of information from web pages

Author: Caha Tomáš
Publication venue: Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií
Publication date: 01/01/2016
Field of study

Tato práce se věnuje problematice separace informací z webových stránek vybraných geolokačních služeb. Jsou shrnuty používané metody geografické lokalizace síťových zařízení a množství údajů poskytovaných vybranými volně dostupnými geolokačními databázemi. Především jsou popsány způsoby získávání informací o IP adresách z rozhraní jednotlivých databází. V práci jsou představeny způsoby, jakých bylo využito při programování systému na automatický odhad geografické polohy zadaných IP adres načítaných ze zdrojového souboru a porovnání získaných údajů s referenčními daty. Vytvořený systém v jazyce Python poskytuje jednoduchý způsob ověření informací o zadaných IP adresách celkem v pěti volně dostupných geolokačních databázích. Dále bylo také provedeno vyhodnocení přesnosti získávaných dat a srovnání jednotlivých volně dostupných geolokačních databázích.This thesis deals with the separation of information from websites of selected geolocation services. Methods of geographical location of network devices and amount of available data provided by chosen freely accessible geolocation databases. The data are summarized with focus on methods of obtaining information about IP addresses from APIs of particular databases. In the paper there are also presented ways used to develop the system for automatic estimation of geographic location of IP addresses specified in source file and process of comparing retrieved data with reference data. The system is created in Python and provides a simple way for verifying information about given IP addresses in five freely accessible databases. Furthermore, accuracy of the retrieved data is evaluated and five geolocation databases is compared.

Digital library of Brno University of Technology

National Repository of Grey Literature

Web Content Mining for Information on Information Scientists

Author: Risse Sarah
Publication venue
Publication date: 28/04/2011
Field of study

This paper presents a search system for information on scientists which was implemented prototypically for the area of information science, employing Web Content Mining techniques. The sources that are used in the implemented approach are online publication services and personal homepages of scientists. The system contains wrappers for querying the publication services and information extraction from their result pages, as well as methods for information extraction from homepages, which are based on heuristics concerning structure and composition of the pages. Moreover a specialised search technique for searching for personal homepages of information scientists was developed

University of Hildesheim

User driven information extraction with LODIE

Author: Gentile Anna Lisa
Mazumdar Suvodeep
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2014
Field of study

Information Extraction (IE) is the technique for transforming unstructured or semi-structured data into structured representation that can be understood by machines. In this paper we use a user-driven Information Extraction technique to wrap entity-centric Web pages. The user can select concepts and properties of interest from available Linked Data. Given a number of websites containing pages about the concepts of interest, the method will exploit (i) recurrent structures in the Web pages and (ii) available knowledge in Linked data to extract the information of interest from the Web pages

CiteSeerX

Sheffield Hallam University Research Archive

MAnnheim DOCument Server

Design of Automatically Adaptable Web Wrappers

Author: Baumgartner Robert
Ferrara Emilio
Publication venue
Publication date: 01/01/2011
Field of study

Nowadays, the huge amount of information distributed through the Web motivates studying techniques to\ud be adopted in order to extract relevant data in an efﬁcient and reliable way. Both academia and enterprises\ud developed several approaches of Web data extraction, for example using techniques of artiﬁcial intelligence or\ud machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision\ud of information extracted from Web pages, and, at the same time, have to prove robustness in order not to\ud compromise quality and reliability of data themselves.\ud In this paper we focus on some experimental aspects related to the robustness of the data extraction process\ud and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for\ud ﬁnding similarities between two different version of a Web page, in order to handle modiﬁcations, avoiding\ud the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate\ud performances, advantages and draw-backs of our novel system of automatic wrapper adaptation

arXiv.org e-Print Archive

CiteSeerX

CogPrints Cognitive Sciences Eprint Archive

Boilerplate Removal using a Neural Sequence Labeling Model

Author: Anand Avishek
Khosla Megha
Leonhardt Jurek
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/04/2020
Field of study

The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape

arXiv.org e-Print Archive

Crossref

Automated retrieval and extraction of training course information from unstructured web pages

Author: Daniela Xhemali (7168856)
Publication venue
Publication date: 01/01/2010
Field of study

Web Information Extraction (WIE) is the discipline dealing with the discovery, processing and extraction of specific pieces of information from semi-structured or unstructured web pages. The World Wide Web comprises billions of web pages and there is much need for systems that will locate, extract and integrate the acquired knowledge into organisations practices. There are some commercial, automated web extraction software packages, however their success comes from heavily involving their users in the process of finding the relevant web pages, preparing the system to recognise items of interest on these pages and manually dealing with the evaluation and storage of the extracted results. This research has explored WIE, specifically with regard to the automation of the extraction and validation of online training information. The work also includes research and development in the area of automated Web Information Retrieval (WIR), more specifically in Web Searching (or Crawling) and Web Classification. Different technologies were considered, however after much consideration, Naïve Bayes Networks were chosen as the most suitable for the development of the classification system. The extraction part of the system used Genetic Programming (GP) for the generation of web extraction solutions. Specifically, GP was used to evolve Regular Expressions, which were then used to extract specific training course information from the web such as: course names, prices, dates and locations. The experimental results indicate that all three aspects of this research perform very well, with the Web Crawler outperforming existing crawling systems, the Web Classifier performing with an accuracy of over 95% and a precision of over 98%, and the Web Extractor achieving an accuracy of over 94% for the extraction of course titles and an accuracy of just under 67% for the extraction of other course attributes such as dates, prices and locations. Furthermore, the overall work is of great significance to the sponsoring company, as it simplifies and improves the existing time-consuming, labour-intensive and error-prone manual techniques, as will be discussed in this thesis. The prototype developed in this research works in the background and requires very little, often no, human assistance

Loughborough University Institutional Repository