Abstract. The web has become a pig sty—everyone dumps information at random places and in random shapes. Try to find the cheapest apartment in Oxford considering rent, travel, tax and heating costs; or a cheap, reasonable reviewed 11 ” laptop with an SSD drive. Data extraction flushes structured information out of this sty: It turns mostly unstructured web pages into highly structured knowledge. In this chapter, we give a gentle introduction to data extraction including pointers to existing systems. We start with an overview and classification of data extraction systems along two primary dimensions, the level of supervision and the considered scale. The rest of the chapter is organized along the major division of these approaches into site-specific and supervised versus domain-specific and unsupervised. We first discuss supervised data extraction, where a human user identifies for each site examples of the relevant data and the system generalizes these examples into extraction programs. We focus particularly on declarative and rule-based paradigms. In the second part, we turn to fully automated (or unsupervised) approaches where the system by itself identifies the relevant data and fully automatically extracts data from many websites. Ontologies or schemata have proven invaluable to guide unsupervised data extraction and we present an overview of the existing approaches and the different ways in which they are using ontologies. 1
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.