24,652 research outputs found
Finding Person Relations in Image Data of the Internet Archive
The multimedia content in the World Wide Web is rapidly growing and contains
valuable information for many applications in different domains. For this
reason, the Internet Archive initiative has been gathering billions of
time-versioned web pages since the mid-nineties. However, the huge amount of
data is rarely labeled with appropriate metadata and automatic approaches are
required to enable semantic search. Normally, the textual content of the
Internet Archive is used to extract entities and their possible relations
across domains such as politics and entertainment, whereas image and video
content is usually neglected. In this paper, we introduce a system for person
recognition in image content of web news stored in the Internet Archive. Thus,
the system complements entity recognition in text and allows researchers and
analysts to track media coverage and relations of persons more precisely. Based
on a deep learning face recognition approach, we suggest a system that
automatically detects persons of interest and gathers sample material, which is
subsequently used to identify them in the image data of the Internet Archive.
We evaluate the performance of the face recognition system on an appropriate
standard benchmark dataset and demonstrate the feasibility of the approach with
two use cases
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Concept Extraction Challenge: University of Twente at #MSM2013
Twitter messages are a potentially rich source of continuously and instantly updated information. Shortness and informality of such messages are challenges for Natural Language Processing tasks. In this paper we present a hybrid approach for Named Entity Extraction (NEE) and Classification (NEC) for tweets. The system uses the power of the Conditional Random Fields (CRF) and the Support Vector Machines (SVM) in a hybrid way to achieve better results. For named entity type classification we used AIDA \cite{YosefHBSW11} disambiguation system to disambiguate the extracted named entities and hence find their type
- âŠ