1 research outputs found
Web-scale profiling of semantic annotations in HTML pages
The vision of the Semantic Web was coined by Tim Berners-Lee almost two decades
ago. The idea describes an extension of the existing Web in which āinformation
is given well-deļ¬ned meaning, better enabling computers and people to work in
cooperationā [Berners-Lee et al., 2001].
Semantic annotations in HTML pages are one realization of this vision which
was adopted by large numbers of web sites in the last years. Semantic annotations
are integrated into the code of HTML pages using one of the three markup languages
Microformats, RDFa, or Microdata. Major consumers of semantic annotations are
the search engine companies Bing, Google, Yahoo!, and Yandex. They use semantic
annotations from crawled web pages to enrich the presentation of search results and
to complement their knowledge bases.
However, outside the large search engine companies, little is known about
the deployment of semantic annotations: How many web sites deploy semantic
annotations? What are the topics covered by semantic annotations? How detailed
are the annotations? Do web sites use semantic annotations correctly? Are semantic
annotations useful for others than the search engine companies? And how can
semantic annotations be gathered from the Web in that case?
The thesis answers these questions by proļ¬ling the web-wide deployment of
semantic annotations.
The topic is approached in three consecutive steps: In the ļ¬rst step, two approaches
for extracting semantic annotations from the Web are discussed. The thesis
evaluates ļ¬rst the technique of focused crawling for harvesting semantic annotations.
Afterward, a framework to extract semantic annotations from existing web crawl
corpora is described. The two extraction approaches are then compared for the
purpose of analyzing the deployment of semantic annotations in the Web.
In the second step, the thesis analyzes the overall and markup language-speciļ¬c
adoption of semantic annotations. This empirical investigation is based on the largest
web corpus that is available to the public. Further, the topics covered by deployed
semantic annotations and their evolution over time are analyzed. Subsequent studies
examine common errors within semantic annotations. In addition, the thesis analyzes
the data overlap of the entities that are described by semantic annotations from the
same and across different web sites.
The third step narrows the focus of the analysis towards use case-speciļ¬c
issues. Based on the requirements of a marketplace, a news aggregator, and a
travel portal the thesis empirically examines the utility of semantic annotations for
these use cases. Additional experiments analyze the capability of product-related
semantic annotations to be integrated into an existing product categorization schema.
Especially, the potential of exploiting the diverse category information given by the
web sites providing semantic annotations is evaluated