63,443 research outputs found
Web Content Mining for Information on Information Scientists
This paper presents a search system for information on scientists which was implemented prototypically for the area of information science, employing Web Content Mining techniques. The sources that are used in the implemented approach are online publication services and personal homepages of scientists. The system contains wrappers for querying the publication services and information extraction from their result pages, as well as methods for information extraction from homepages, which are based on heuristics concerning structure and composition of the pages. Moreover a specialised search technique for searching for personal homepages of information scientists was developed
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Template Mining for Information Extraction from Digital Documents
published or submitted for publicatio
Deliverable D2.6 LinkedTV Framework for Generating Video Enrichments with Annotations
This deliverable describes the final LinkedTV framework that provides a set of possible enrichment resources for seed video content using techniques such as text and web mining, information extraction and information retrieval technologies. The enrichment content is obtained from four type of sources: a) by crawling and indexing web sites described in a white list specified by the content partners, b) by querying the API or SPARQL endpoint of the Europeana digital library network which is publicly exposed, c) by querying multiple social networking APIs, d) by hyperlinking to other parts of TV programs within the same collection using a Solr index. This deliverable also describes an additional content annotation functionality, namely labelling enrichment (as well as seed) content with thematic topics, as well as the process of exposing content annotations to this module and to the filtering services of LinkedTV’s personalization workflow. We illustrate the enrichment workflow for the two main scenarios of LinkedTV which have lead to the development of the LinkedCulture and LinkedNews applications, which respectively use the TVEnricher and TVNewsEnricher enrichment services. The original title of this deliverable from the DoW was Advanced concept labelling by complementary Web mining
- …